2025-05-30-12-06
A Large Language Model-Enabled Control Architecture for Dynamic Resource Capability Exploration in Multi-Agent Manufacturing Systems
Abstract
arXiv:2505.22814v1 Announce Type: new Abstract: Manufacturing environments are becoming more complex and unpredictable due to factors such as demand variations and shorter product lifespans. This complexity requires real-time decision-making and adaptation to disruptions. Traditional control approaches highlight the need for advanced control strategies capable of overcoming unforeseen challenges, as they demonstrate limitations in responsiveness within dynamic industrial settings. Multi-agent systems address these challenges through decentralization of decision-making, enabling systems to respond dynamically to operational changes. However, current multi-agent systems encounter challenges related to real-time adaptation, context-aware decision-making, and the dynamic exploration of resource capabilities. Large language models provide the possibility to overcome these limitations through context-aware decision-making capabilities. This paper introduces a large language model-enabled control architecture for multi-agent manufacturing systems to dynamically explore resource capabilities in response to real-time disruptions. A simulation-based case study demonstrates that the proposed architecture improves system resilience and flexibility. The case study findings show improved throughput and efficient resource utilization compared to existing approaches.
摘要
制造环境因需求波动和产品生命周期缩短等因素正变得日益复杂且难以预测。这种复杂性要求系统具备实时决策能力及对突发干扰的适应性。传统控制方法在动态工业环境中的响应能力存在局限,凸显了对先进控制策略的需求。多智能体系统通过决策分散化应对这些挑战,使系统能够动态响应运行变化。然而现有多智能体系统仍面临实时适应、情境感知决策及资源能力动态探索等挑战。大语言模型凭借其情境感知决策能力,为突破这些限制提供了可能。本文提出一种基于大语言模型的多智能体制造系统控制架构,用于动态探索资源能力以应对实时干扰。基于仿真的案例研究表明,该架构提升了系统韧性与灵活性。研究发现,相较于现有方法,该架构实现了吞吐量提升与资源利用效率优化。
Profiling and optimization of multi-card GPU machine learning jobs
Abstract
arXiv:2505.22905v1 Announce Type: new Abstract: The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing, are analyzed. Selected optimization strategies are studied in detail, highlighting the related challenges and advantages of their implementation. Furthermore, the impact of different performance improvement techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of large language models is investigated. Experimental results illustrate how the nature of the task affects the iteration time in a multiprocessor environment, VRAM utilization, and overall memory transfers. Test scenarios are evaluated on the modern NVIDIA H100 GPU architecture.
摘要
机器学习方法的有效性和效率至关重要,尤其在结果质量与计算成本方面。本文探讨了不同的模型优化技术,对关键性能指标进行了全面分析。研究针对图像识别任务提出了多种并行化策略,这些策略适配不同的硬件与软件配置,包括分布式数据并行和分布式硬件处理方案。我们对选定的优化策略进行了详细研究,重点阐述了其实施过程中的相关挑战与优势。此外,本文还探究了不同性能提升技术(DPO、LoRA、QLoRA和QAT)对大型语言模型调优过程的影响。实验结果表明任务特性如何影响多处理器环境中的迭代时间、显存利用率以及整体内存传输量。所有测试场景均在现代NVIDIA H100 GPU架构上进行评估。
Be.FM: Open Foundation Models for Human Behavior
Abstract
arXiv:2505.23058v1 Announce Type: new Abstract: Despite their success in numerous fields, the potential of foundation models for modeling and understanding human behavior remains largely unexplored. We introduce Be.FM, one of the first open foundation models designed for human behavior modeling. Built upon open-source large language models and fine-tuned on a diverse range of behavioral data, Be.FM can be used to understand and predict human decision-making. We construct a comprehensive set of benchmark tasks for testing the capabilities of behavioral foundation models. Our results demonstrate that Be.FM can predict behaviors, infer characteristics of individuals and populations, generate insights about contexts, and apply behavioral science knowledge.
摘要
尽管基础模型在众多领域取得了成功,但其在人类行为建模与理解方面的潜力仍亟待探索。本文提出Be.FM——首个面向人类行为建模的开放基础模型之一。该模型基于开源大语言框架构建,并通过多源行为数据微调,能够用于理解和预测人类决策行为。我们建立了一套完整的基准测试任务体系以评估行为基础模型的性能。实验结果表明,Be.FM具备行为预测、个体与群体特征推断、情境洞察生成以及行为科学知识应用等能力。
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?
Abstract
arXiv:2505.22756v1 Announce Type: new Abstract: Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.
摘要
数学推理任务已成为评估大语言模型(LLM)推理能力的重要基准,尤其是采用强化学习(RL)方法(如GRPO)后,模型性能显著提升。然而,仅凭准确率指标无法支持对能力的细粒度评估,也难以揭示模型内化了哪些问题解决技能。为深入理解这些能力,我们提出将问题解决分解为三项基础能力:规划(将问题映射为解题步骤序列)、执行(正确执行解题步骤)和验证(判断解答的正确性)。实证研究发现,GRPO主要增强了执行能力——提升模型对已知问题解的稳健性,这一现象我们称之为"温度蒸馏"。更重要的是,我们发现经过RL训练的模型在面对全新问题时表现欠佳,由于规划能力不足而遭遇"覆盖墙"。为深入探究RL的影响,我们构建了一个最小化的合成解树导航任务作为数学问题解决的类比。这一受控实验复现了实证结果,证实RL主要提升执行稳健性。值得注意的是,在此设定下,我们发现了RL可能通过改进探索能力和对新解路径的泛化来突破覆盖墙的条件。本研究揭示了RL在增强LLM推理中的作用,指出现有关键局限,并为突破这些障碍提供了路径。代码详见https://github.com/cfpark00/RL-Wall。
Design and testing of an agent chatbot supporting decision making with public transport data
Abstract
arXiv:2505.22698v1 Announce Type: new Abstract: Assessing the quality of public transportation services requires the analysis of large quantities of data on the scheduled and actual trips and documents listing the quality constraints each service needs to meet. Interrogating such datasets with SQL queries, organizing and visualizing the data can be quite complex for most users. This paper presents a chatbot offering a user-friendly tool to interact with these datasets and support decision making. It is based on an agent architecture, which expands the capabilities of the core Large Language Model (LLM) by allowing it to interact with a series of tools that can execute several tasks, like performing SQL queries, plotting data and creating maps from the coordinates of a trip and its stops. This paper also tackles one of the main open problems of such Generative AI projects: collecting data to measure the system's performance. Our chatbot has been extensively tested with a workflow that asks several questions and stores the generated query, the retrieved data and the natural language response for each of them. Such questions are drawn from a set of base examples which are then completed with actual data from the database. This procedure yields a dataset for the evaluation of the chatbot's performance, especially the consistency of its answers and the correctness of the generated queries.
摘要
评估公共交通服务质量需要分析大量关于计划行程与实际行程的数据,以及列明各项服务需满足质量约束的文档。对于大多数用户而言,使用SQL查询来检索此类数据集、组织并可视化数据可能相当复杂。本文提出一种聊天机器人,为用户提供与这些数据集交互的友好工具以支持决策。该工具基于智能体架构,通过让核心大语言模型(LLM)与一系列工具交互来扩展其功能,这些工具可执行多项任务,如运行SQL查询、绘制数据图表,以及根据行程及其站点的坐标生成地图。本文还解决了此类生成式人工智能项目的主要开放性问题之一:收集数据以衡量系统性能。我们通过特定工作流程对聊天机器人进行了全面测试,该流程会提出若干问题,并存储每个问题对应的生成查询、检索数据及自然语言回答。这些问题源自一组基础案例,随后通过数据库中的实际数据进行补充。此流程生成了一个用于评估聊天机器人性能的数据集,特别是其回答的一致性和生成查询的正确性。
TailorSQL: An NL2SQL System Tailored to Your Query Workload
Abstract
arXiv:2505.23039v1 Announce Type: new Abstract: NL2SQL (natural language to SQL) translates natural language questions into SQL queries, thereby making structured data accessible to non-technical users, serving as the foundation for intelligent data applications. State-of-the-art NL2SQL techniques typically perform translation by retrieving database-specific information, such as the database schema, and invoking a pre-trained large language model (LLM) using the question and retrieved information to generate the SQL query. However, existing NL2SQL techniques miss a key opportunity which is present in real-world settings: NL2SQL is typically applied on existing databases which have already served many SQL queries in the past. The past query workload implicitly contains information which is helpful for accurate NL2SQL translation and is not apparent from the database schema alone, such as common join paths and the semantics of obscurely-named tables and columns. We introduce TailorSQL, a NL2SQL system that takes advantage of information in the past query workload to improve both the accuracy and latency of translating natural language questions into SQL. By specializing to a given workload, TailorSQL achieves up to 2 improvement in execution accuracy on standardized benchmarks.
摘要
NL2SQL(自然语言转SQL)技术将自然语言问题转化为SQL查询语句,使非技术用户能够访问结构化数据,成为智能数据应用的基础。最先进的NL2SQL技术通常通过检索数据库特定信息(如数据库模式)并调用预训练的大语言模型(LLM),结合问题与检索信息生成SQL查询。然而,现有NL2SQL技术忽略了实际场景中的关键机会:NL2SQL通常应用于已有历史SQL查询记录的现有数据库。这些过往查询负载隐含了有助于提升NL2SQL翻译准确性的信息(如常见连接路径及命名晦涩的表/列语义),这些信息无法仅通过数据库模式获取。我们提出TailorSQL系统,利用历史查询负载中的信息来同时提高自然语言转SQL的准确性与响应速度。通过针对特定工作负载进行优化,TailorSQL在标准测试基准上实现了最高2倍的执行准确率提升。
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness
Abstract
arXiv:2505.22960v1 Announce Type: new Abstract: The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks. Our study systematically examines the influence of task difficulty, model scale, and agent diversity on MAD's performance. Key findings reveal that, for mathematical reasoning, MAD offers limited advantages over self-agent scaling but becomes more effective with increased problem difficulty and decreased model capability, while agent diversity shows little benefit. Conversely, for safety tasks, MAD's collaborative refinement can increase vulnerability, but incorporating diverse agent configurations facilitates a gradual reduction in attack success through the collaborative refinement process. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.
摘要
大语言模型(LLM)能力的显著增长推动了对多智能体系统的探索,其中辩论框架作为一种增强问题解决能力的有前景途径而崭露头角。这些多智能体辩论(MAD)方法通过智能体协作提出、批判和完善论点,相较于单一模型可能提供更优的推理能力、鲁棒性及多元视角。尽管已有研究利用MAD方法,但其与单智能体方法相比的有效性,尤其是在不同条件下的系统性理解仍不明确。本文通过将MAD概念化为一种测试时计算扩展技术(其特点在于协作优化和多样化探索能力),试图填补这一空白。我们在数学推理和安全性相关任务上,对MAD与强基线单智能体测试时扩展方法进行了全面实证比较。研究系统考察了任务难度、模型规模和智能体多样性对MAD性能的影响。关键发现表明:在数学推理任务中,MAD相较单智能体扩展优势有限,但随着问题难度增加或模型能力下降,其有效性会提升,而智能体多样性则未显现明显益处;相反在安全任务中,MAD的协作优化可能增加系统脆弱性,但通过引入多样化智能体配置,协作优化过程可逐步降低攻击成功率。我们相信这些发现为未来开发更有效且具备战略部署价值的MAD系统提供了关键指导。
Speeding up Model Loading with fastsafetensors
Abstract
arXiv:2505.23072v1 Announce Type: new Abstract: The rapid increases in model parameter sizes introduces new challenges in pre-trained model loading. Currently, machine learning code often deserializes each parameter as a tensor object in host memory before copying it to device memory. We found that this approach underutilized storage throughput and significantly slowed down loading large models with a widely-used model file formats, safetensors. In this work, we present fastsafetensors, a Python library designed to optimize the deserialization of tensors in safetensors files. Our approach first copies groups of on-disk parameters to device memory, where they are directly instantiated as tensor objects. This design enables further optimization in low-level I/O and high-level tensor preprocessing, including parallelized copying, peer-to-peer DMA, and GPU offloading. Experimental results show performance improvements of 4.8x to 7.5x in loading models such as Llama (7, 13, and 70 billion parameters), Falcon (40 billion parameters), and the Bloom (176 billion parameters).
摘要
模型参数规模的快速增长给预训练模型加载带来了新的挑战。当前机器学习代码通常先将每个参数作为张量对象在主机内存中反序列化,再将其复制到设备内存。我们发现这种方法未能充分利用存储吞吐量,并显著降低了使用广泛采用的模型文件格式safetensors加载大型模型的速度。本研究提出fastsafetensors——一个专为优化safetensors文件中张量反序列化而设计的Python库。我们的方法首先将磁盘上的参数组直接复制到设备内存,并在其中实例化为张量对象。该设计支持在底层I/O和高层张量预处理环节实施进一步优化,包括并行复制、点对点直接内存访问以及GPU卸载等。实验结果表明,在加载Llama(70亿、130亿和700亿参数)、Falcon(400亿参数)和Bloom(1760亿参数)等模型时,性能提升了4.8至7.5倍。
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
Abstract
arXiv:2505.23219v1 Announce Type: new Abstract: In-situ LLM inference on end-user devices has gained significant interest due to its privacy benefits and reduced dependency on external infrastructure. However, as the decoding process is memory-bandwidth-bound, the diverse processing units in modern end-user devices cannot be fully exploited, resulting in slow LLM inference. This paper presents Ghidorah, a LLM inference system for end-user devices with the unified memory architecture. The key idea of Ghidorah can be summarized in two steps: 1) leveraging speculative decoding approaches to enhance parallelism, and 2) ingeniously distributing workloads across multiple heterogeneous processing units to maximize computing power utilization. Ghidorah includes the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. The HCMP architecture guides partitioning by leveraging the unified memory design of end-user devices and adapting to the hybrid computational demands of speculative decoding. The ARCA approach is used to determine the optimal speculative strategy and partitioning strategy, balancing acceptance rate with parallel capability to maximize the speedup. Additionally, we optimize sparse computation on ARM CPUs. Experimental results show that Ghidorah can achieve up to 7.6x speedup in the dominant LLM decoding phase compared to the sequential decoding approach in NVIDIA Jetson NX.
摘要
由于隐私优势及降低对外部基础设施的依赖,在终端用户设备上进行大语言模型(LLM)原位推理受到广泛关注。然而,由于解码过程受内存带宽限制,现代终端设备中的多样化处理单元无法被充分利用,导致LLM推理速度缓慢。本文提出Ghidorah——一种面向统一内存架构终端设备的LLM推理系统,其核心思想可归纳为两步:1)利用推测式解码方法增强并行性;2)通过智能分配异构处理单元的工作负载实现计算资源最大化利用。Ghidorah包含异构核心模型并行(HCMP)架构与架构感知性能分析(ARCA)方法:HCMP架构依托终端设备的统一内存设计进行任务划分,并适配推测式解码的混合计算需求;ARCA方法用于确定最优推测策略与划分策略,通过平衡接受率与并行能力实现加速比最大化。此外,我们还优化了ARM CPU的稀疏计算。实验表明,在NVIDIA Jetson NX平台上,Ghidorah相较于顺序解码方法在LLM主导的解码阶段可实现最高7.6倍的加速比。
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
Abstract
arXiv:2505.23091v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini).
摘要
大型语言模型(LLMs)的最新进展在推理能力方面取得了显著进步,例如DeepSeek-R1通过基于规则的强化学习显著提升了逻辑推理能力。然而,将这些成果扩展到多模态大型语言模型(MLLMs)面临关键挑战,对于基础推理能力通常较弱的多模态小型语言模型(MSLMs)而言,这些问题尤为突出:(1)高质量多模态推理数据集的稀缺性;(2)视觉处理整合导致的推理能力退化;(3)直接应用强化学习可能产生复杂但错误的推理过程的风险。为解决这些挑战,我们设计了新颖框架Infi-MMR,通过三个精心设计的阶段性课程系统化释放MSLMs的推理潜力,并提出了多模态推理模型Infi-MMR-3B。第一阶段"基础推理激活"利用高质量文本推理数据集激活并增强模型的逻辑推理能力。第二阶段"跨模态推理适应"通过标题增强的多模态数据促进推理技能向多模态场景的渐进迁移。第三阶段"多模态推理强化"采用精选的无标题多模态数据以减少语言偏见并促进稳健的跨模态推理。Infi-MMR-3B在多模态数学推理能力(MathVerse testmini 43.68%、MathVision test 27.04%、OlympiadBench 21.33%)和通用推理能力(MathVista testmini 67.2%)上均达到最先进水平。
MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning
Abstract
arXiv:2505.23254v1 Announce Type: new Abstract: Owing to the huge success of generative artificial intelligence (AI), large language models (LLMs) have emerged as a core subclass, underpinning applications such as question answering, text generation, and code completion. While fine-tuning these models on domain-specific data can yield significant performance gains, it also poses daunting computational challenges, especially for researchers and small organizations with limited hardware resources. Although SSD offloading (i.e., ZeRO-Infinity) has emerged as a viable strategy to overcome the GPU memory barrier via leveraging both system memory (i.e., CPU DRAM) and storage space (i.e., solid-state devices, SSDs), its design primarily targets model-centric performance issues. As a result, key system-level issues, including system memory fragmentation, inefficient pinned buffer allocation, peak CPU usage spikes, and file system overhead, remain unaddressed, stifling scalability and inflating costs. Such an observation motivates this paper to introduce MemAscend, a framework that systematically tackles the underexplored system memory bottlenecks in SSD-offloaded LLM training, with a focus on resource-constrained environments. By streamlining pinned-memory allocation, eradicating fragmentation, and mitigating peak overhead, MemAscend reclaims a substantial system memory budget, enabling larger models, longer context windows, and higher batch sizes without exceeding modest hardware limits. Across diverse LLM benchmarks, MemAscend reduces peak system-memory consumption by an average of 55.7% compared with standard SSD offloading techniques, lowering the hardware barrier for fine-tuning and unlocking new possibilities for cost-effective large-scale training on limited-resource machines.
摘要
由于生成式人工智能(AI)的巨大成功,大语言模型(LLM)已成为核心子类,支撑着问答、文本生成和代码补全等应用。虽然在领域特定数据上微调这些模型能显著提升性能,但也带来了巨大的计算挑战,尤其对硬件资源有限的研究者和小型机构而言。尽管SSD卸载技术(如ZeRO-Infinity)通过利用系统内存(CPU DRAM)和存储空间(固态硬盘)成为突破GPU内存壁垒的可行策略,但其设计主要针对模型中心的性能问题。这导致关键系统级问题——包括系统内存碎片、低效的固定缓冲区分配、CPU使用率峰值激增及文件系统开销——仍未解决,从而抑制了可扩展性并推高了成本。这一观察促使本文提出MemAscend框架,该系统化解决SSD卸载LLM训练中尚未充分探索的系统内存瓶颈,尤其关注资源受限环境。通过优化固定内存分配、消除碎片并降低峰值开销,MemAscend回收了大量系统内存预算,使得在有限硬件条件下能训练更大模型、支持更长上下文窗口和更高批量大小。在多样化LLM基准测试中,相比标准SSD卸载技术,MemAscend平均降低55.7%的系统内存峰值消耗,降低了微调的硬件门槛,为资源受限机器上经济高效的大规模训练开辟了新可能。
Second Opinion Matters: Towards Adaptive Clinical AI via the Consensus of Expert Model Ensemble
Abstract
arXiv:2505.23075v1 Announce Type: new Abstract: Despite the growing clinical adoption of large language models (LLMs), current approaches heavily rely on single model architectures. To overcome risks of obsolescence and rigid dependence on single model systems, we present a novel framework, termed the Consensus Mechanism. Mimicking clinical triage and multidisciplinary clinical decision-making, the Consensus Mechanism implements an ensemble of specialized medical expert agents enabling improved clinical decision making while maintaining robust adaptability. This architecture enables the Consensus Mechanism to be optimized for cost, latency, or performance, purely based on its interior model configuration. To rigorously evaluate the Consensus Mechanism, we employed three medical evaluation benchmarks: MedMCQA, MedQA, and MedXpertQA Text, and the differential diagnosis dataset, DDX+. On MedXpertQA, the Consensus Mechanism achieved an accuracy of 61.0% compared to 53.5% and 45.9% for OpenAI's O3 and Google's Gemini 2.5 Pro. Improvement was consistent across benchmarks with an increase in accuracy on MedQA (\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 3.4\%) and MedMCQA (\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 9.1\%). These accuracy gains extended to differential diagnosis generation, where our system demonstrated improved recall and precision (F1_\mathrm{consensus} = 0.326 vs. F1_{\mathrm{O3\text{-}high}} = 0.2886) and a higher top-1 accuracy for DDX (Top1_\mathrm{consensus} = 52.0% vs. Top1_{\mathrm{O3\text{-}high}} = 45.2%).
摘要
尽管大型语言模型(LLMs)在临床中的应用日益广泛,但当前方法严重依赖单一模型架构。为克服模型过时和单一系统刚性依赖的风险,我们提出了一种称为"共识机制"的新框架。该机制通过模拟临床分诊和多学科临床决策流程,集成多个专业医学专家代理,在提升临床决策质量的同时保持强大的适应性。此架构使共识机制能够根据内部模型配置,灵活优化成本、延迟或性能表现。
为严格评估共识机制,我们采用三个医学评估基准(MedMCQA、MedQA和MedXpertQA Text)及鉴别诊断数据集DDX+。在MedXpertQA上,共识机制准确率达61.0%,优于OpenAI O3(53.5%)和Google Gemini 2.5 Pro(45.9%)。各基准测试均显示性能提升:MedQA准确率提升3.4%(\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}}),MedMCQA提升9.1%。该优势延伸至鉴别诊断生成领域,本系统在召回率与精确度(F1_\mathrm{consensus}=0.326 vs. F1_{\mathrm{O3\text{-}high}}=0.2886)及DDX top-1准确率(Top1_\mathrm{consensus}=52.0% vs. Top1_{\mathrm{O3\text{-}high}}=45.2%)上均表现出显著改进。
Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems
Abstract
arXiv:2505.23352v1 Announce Type: new Abstract: The communication topology in large language model-based multi-agent systems fundamentally governs inter-agent collaboration patterns, critically shaping both the efficiency and effectiveness of collective decision-making. While recent studies for communication topology automated design tend to construct sparse structures for efficiency, they often overlook why and when sparse and dense topologies help or hinder collaboration. In this paper, we present a causal framework to analyze how agent outputs, whether correct or erroneous, propagate under topologies with varying sparsity. Our empirical studies reveal that moderately sparse topologies, which effectively suppress error propagation while preserving beneficial information diffusion, typically achieve optimal task performance. Guided by this insight, we propose a novel topology design approach, EIB-leanrner, that balances error suppression and beneficial information propagation by fusing connectivity patterns from both dense and sparse graphs. Extensive experiments show the superior effectiveness, communication cost, and robustness of EIB-leanrner.
摘要
基于大语言模型的多智能体系统中,通信拓扑结构从根本上决定了智能体间的协作模式,对集体决策的效率和效果具有关键性影响。尽管近期关于通信拓扑自动设计的研究倾向于构建稀疏结构以提升效率,但这些研究往往忽视了稀疏与密集拓扑在何种情况下会促进或阻碍协作的内在机制。本文提出一个因果分析框架,用以研究智能体输出(无论正确或错误)在不同稀疏度拓扑下的传播规律。实证研究表明,适度稀疏的拓扑能有效抑制错误传播同时保留有益信息扩散,通常可实现最优任务性能。基于此发现,我们提出一种新型拓扑设计方法EIB-leanrner,通过融合密集图与稀疏图的连接模式,实现错误抑制与有益信息传播的平衡。大量实验证明EIB-leanrner在效能、通信成本和鲁棒性方面均具有显著优势。
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Abstract
arXiv:2505.23281v1 Announce Type: new Abstract: The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as SMT 2025 -- published well after model release dates -- demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On USAMO 2025, even top models score below 25%, far behind their performance on final-answer tasks. So far, we have evaluated 30 models across five competitions, totaling 149 problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.
摘要
大型语言模型(LLMs)推理能力的快速提升显著改善了数学基准测试表现。然而,当前最常用的评估数据集(如AIME 2024)大多可在线公开获取,导致难以区分模型是真正具备推理能力还是存在记忆现象。此外,这些基准未能评估对数学任务至关重要的证明撰写能力。为此,我们提出MathArena基准,其核心思路在于:周期性数学竞赛能持续提供高质量、高难度题目,可用于LLMs的实时评估。通过在新题发布后立即测试模型,我们有效消除了数据污染风险。基于该框架,我们在AIME 2024中发现了明显的数据污染迹象。而在模型发布后新推出的高难度竞赛(如SMT 2025)上,顶尖模型仍展现出卓越的推理能力。MathArena也是首个评估证明撰写能力的基准——在USAMO 2025测试中,即使最优模型得分也不足25%,远低于其最终答案类任务的表现。目前我们已评估30个模型在5项竞赛中的149道题目。作为动态基准,MathArena将持续追踪LLMs在新发布竞赛中的表现,确保对数学推理能力进行严格且与时俱进的评估。
Autoformalization in the Era of Large Language Models: A Survey
Abstract
arXiv:2505.23486v1 Announce Type: new Abstract: Autoformalization, the process of transforming informal mathematical propositions into verifiable formal representations, is a foundational task in automated theorem proving, offering a new perspective on the use of mathematics in both theoretical and applied domains. Driven by the rapid progress in artificial intelligence, particularly large language models (LLMs), this field has witnessed substantial growth, bringing both new opportunities and unique challenges. In this survey, we provide a comprehensive overview of recent advances in autoformalization from both mathematical and LLM-centric perspectives. We examine how autoformalization is applied across various mathematical domains and levels of difficulty, and analyze the end-to-end workflow from data preprocessing to model design and evaluation. We further explore the emerging role of autoformalization in enhancing the verifiability of LLM-generated outputs, highlighting its potential to improve both the trustworthiness and reasoning capabilities of LLMs. Finally, we summarize key open-source models and datasets supporting current research, and discuss open challenges and promising future directions for the field.
摘要
自动形式化是将非形式化的数学命题转化为可验证的形式化表示的过程,作为自动化定理证明的基础性任务,其为理论及应用领域中数学运用的研究提供了新视角。在人工智能尤其是大语言模型(LLMs)快速发展的推动下,该领域取得了显著进展,既带来新机遇也面临独特挑战。本综述从数学与LLM核心视角出发,全面梳理了自动形式化的最新进展。我们考察了该方法在不同数学领域及难度层级中的应用情况,系统分析了从数据预处理到模型设计与评估的端到端工作流程。进一步探讨了自动形式化在增强LLM生成内容可验证性方面的新兴作用,揭示了其在提升LLMs可信度与推理能力方面的潜力。最后,我们总结了支撑当前研究的关键开源模型与数据集,并就该领域的开放挑战与未来发展方向进行了讨论。
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Abstract
arXiv:2505.23416v1 Announce Type: new Abstract: Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4 and FlashAttention decoding latency by approximately 2, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.
摘要
基于Transformer架构 的大语言模型(LLM)在推理过程中会将上下文缓存为键值(KV)对。随着上下文长度增加,KV缓存规模扩大,导致显著的内存开销和注意力延迟上升。本文提出KVzip方法,这是一种与查询无关的KV缓存淘汰机制,能够实现对压缩后KV缓存在不同查询间的有效复用。KVzip通过底层LLM量化KV对的重要性,利用缓存的KV对重构原始上下文,进而淘汰重要性较低的键值对。大量实验评估表明,KVzip可将KV缓存体积缩小3-4倍,使FlashAttention解码延迟降低约2倍,且在问答、检索、推理和代码理解等任务中性能损失可忽略不计。评估涵盖LLaMA3.1-8B、Qwen2.5-14B和Gemma3-12B等多种模型,上下文长度最高达17万token。KVzip显著优于现有查询感知型KV淘汰方法——后者在多查询场景下即使保留90%缓存预算仍会出现性能下降。
Case-Based Reasoning Enhances the Predictive Power of LLMs in Drug-Drug Interaction
Abstract
arXiv:2505.23034v1 Announce Type: new Abstract: Drug-drug interaction (DDI) prediction is critical for treatment safety. While large language models (LLMs) show promise in pharmaceutical tasks, their effectiveness in DDI prediction remains challenging. Inspired by the well-established clinical practice where physicians routinely reference similar historical cases to guide their decisions through case-based reasoning (CBR), we propose CBR-DDI, a novel framework that distills pharmacological principles from historical cases to improve LLM reasoning for DDI tasks. CBR-DDI constructs a knowledge repository by leveraging LLMs to extract pharmacological insights and graph neural networks (GNNs) to model drug associations. A hybrid retrieval mechanism and dual-layer knowledge-enhanced prompting allow LLMs to effectively retrieve and reuse relevant cases. We further introduce a representative sampling strategy for dynamic case refinement. Extensive experiments demonstrate that CBR-DDI achieves state-of-the-art performance, with a significant 28.7% accuracy improvement over both popular LLMs and CBR baseline, while maintaining high interpretability and flexibility.
摘要
药物相互作用(DDI)预测对治疗安全性至关重要。尽管大语言模型(LLM)在药学任务中展现出潜力,但其在DDI预测中的有效性仍面临挑战。受临床实践中医生常规参考类似历史病例通过案例推理(CBR)指导决策的成熟方法启发,我们提出CBR-DDI框架——该框架通过从历史案例中提炼药理学原理来增强LLM在DDI任务中的推理能力。CBR-DDI构建知识库时采用LLM提取药理学洞见,并利用图神经网络(GNN)建模药物关联关系。通过混合检索机制和双层知识增强提示策略,LLM能够有效检索并复用相关案例。我们还引入动态案例优化的代表性采样策略。大量实验表明,CBR-DDI实现了最先进的性能,相较于主流LLM和CBR基线模型准确率显著提升28.7%,同时保持高解释性与灵活性。
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
Abstract
arXiv:2505.23474v1 Announce Type: new Abstract: Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.
摘要
过程奖励模型(PRMs)在复杂推理和问题解决任务(如具有长期决策能力的大语言模型智能体)中具有关键作用,其通过验证每个中间推理步骤的正确性来实现功能。在现实场景中,大语言模型可能采用多种推理模式(如分解)来解决问题,但不同推理模式下均可能出现错误。因此,过程奖励模型需要能够在推理过程中识别各种推理模式下的错误。然而,现有基准主要关注基于逐步正确性的PRMs评估,缺乏对不同推理模式下PRMs的系统性评估。为弥补这一不足,我们提出Socratic-PRMBench——一个在六种推理模式(包括转换、分解、重组、演绎、验证和整合)下系统评估PRMs的新基准。该基准包含2995条在上述六种推理模式中存在缺陷的推理路径。通过对PRMs及作为批评模型的大语言模型进行实验,我们发现现有PRMs存在显著不足。这些发现揭示了当前PRMs在多种推理模式下评估推理步骤时的重大缺陷。我们希望Socratic-PRMBench能成为系统评估不同推理模式下PRMs的综合测试平台,并为PRMs的未来发展奠定基础。
Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters
Abstract
arXiv:2505.23554v1 Announce Type: new Abstract: In recent years, Large Language Models (LLM) such as ChatGPT, CoPilot, and Gemini have been widely adopted in different areas. As the use of LLMs continues to grow, many efforts have focused on reducing the massive training overheads of these models. But it is the environmental impact of handling user requests to LLMs that is increasingly becoming a concern. Recent studies estimate that the costs of operating LLMs in their inference phase can exceed training costs by 25x per year. As LLMs are queried incessantly, the cumulative carbon footprint for the operational phase has been shown to far exceed the footprint during the training phase. Further, estimates indicate that 500 ml of fresh water is expended for every 20-50 requests to LLMs during inference. To address these important sustainability issues with LLMs, we propose a novel framework called SLIT to co-optimize LLM quality of service (time-to-first token), carbon emissions, water usage, and energy costs. The framework utilizes a machine learning (ML) based metaheuristic to enhance the sustainability of LLM hosting across geo-distributed cloud datacenters. Such a framework will become increasingly vital as LLMs proliferate.
摘要
近年来,诸如ChatGPT、CoPilot和Gemini等大型语言模型(LLM)已在多个领域得到广泛应用。随着LLM使用量的持续增长,许多研究致力于降低这些模型的庞大训练开销。然而,处理用户对LLM请求所产生的环境影响正日益引发关注。最新研究表明,LLM在推理阶段的年运营成本可能超过训练成本的25倍。由于LLM持续接收查询,其运营阶段的累计碳足迹已远超训练阶段。此外,估算显示LLM推理阶段每处理20-50次请求就会消耗500毫升淡水。为解决LLM这些重要的可持续性问题,我们提出名为SLIT的新型框架,协同优化LLM服务质量(首令牌响应时间)、碳排放、水资源消耗及能源成本。该框架采用基于机器学习(ML)的元启发式算法,以提升跨地域分布式云数据中心托管LLM的可持续性。随着LLM的普及,此类框架将变得愈发关键。
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions
Abstract
arXiv:2505.23473v1 Announce Type: new Abstract: Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 140.41% higher average refusal triggering rate across 9 LLMs, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to 14.31% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context.
摘要
大语言模型(LLMs)经常拒绝响应伪恶意指令:这些语义无害的输入查询因保守的安全对齐策略触发了不必要的模型拒绝,显著损害了用户体验。收集此类指令对于评估和缓解过度拒绝行为至关重要,但现有的指令构建方法(如人工创建或指令改写)要么缺乏可扩展性,要么无法生成足够多样且有效的诱发拒绝提示。为解决这些局限性,我们提出了EVOREFUSE——一种提示优化方法,能生成多样化的伪恶意指令,持续在不同LLMs中引发高置信度的拒绝。EVOREFUSE采用进化算法,通过突变策略和重组在指令空间中进行比现有方法更多样化的探索,并迭代优化种子指令以最大化模型拒绝概率的证据下界。基于EVOREFUSE,我们创建了两个新数据集:EVOREFUSE-TEST(包含582条伪恶意指令的基准测试集,在9个LLMs上的平均拒绝触发率比次优基准高140.41%,词汇多样性提升34.86%,模型响应置信度得分提高40.03%)和EVOREFUSE-ALIGN(提供3,000条带标注响应的伪恶意指令,用于监督式和基于偏好的对齐训练)。在EVOREFUSE-ALIGN上监督微调的LLAMA3.1-8B-INSTRUCT模型,其过度拒绝现象比使用次优对齐数据集训练的模型减少达14.31%,且不影响安全性。我们通过EVOREFUSE-TEST分析发现,模型过度拒绝的机制源于对敏感关键词的过度关注而忽略整体上下文。
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring
Abstract
arXiv:2505.23575v1 Announce Type: new Abstract: As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that CoT monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage. However, CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores both reasoning and final outputs and combines them using a weighted average. This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.
摘要
随着人工智能模型自主性日益增强,确保其不会在无人察觉的情况下采取有害行为变得至关重要。作为一种潜在解决方案,我们研究了思维链(CoT)监控方法——该方法通过一个较弱但可信的监控模型持续监督更强大但不可信模型的中间推理步骤。在红队测试环境中(不可信模型被指示在完成编码任务时执行有害的次要任务),我们将CoT监控与仅审查最终输出的行为监控进行比较。研究发现,在行为监控无法可靠识别破坏行为的场景中,CoT监控可将检测率提升高达27个百分点。然而,思维链轨迹也可能包含误导性合理化解释以欺骗监控器,导致在更明显的破坏案例中性能下降。为此,我们提出一种混合协议:独立评分推理过程与最终输出,并通过加权平均进行综合。这种混合监控器在所有测试模型和任务中均持续优于纯CoT和行为监控,对于隐蔽欺骗场景的检测率比纯行为监控高出四倍以上。
TRAP: Targeted Redirecting of Agentic Preferences
Abstract
arXiv:2505.23518v1 Announce Type: new Abstract: Autonomous agentic AI systems powered by vision-language models (VLMs) are rapidly advancing toward real-world deployment, yet their cross-modal reasoning capabilities introduce new attack surfaces for adversarial manipulation that exploit semantic reasoning across modalities. Existing adversarial attacks typically rely on visible pixel perturbations or require privileged model or environment access, making them impractical for stealthy, real-world exploitation. We introduce TRAP, a generative adversarial framework that manipulates the agent's decision-making using diffusion-based semantic injections. Our method combines negative prompt-based degradation with positive semantic optimization, guided by a Siamese semantic network and layout-aware spatial masking. Without requiring access to model internals, TRAP produces visually natural images yet induces consistent selection biases in agentic AI systems. We evaluate TRAP on the Microsoft Common Objects in Context (COCO) dataset, building multi-candidate decision scenarios. Across these scenarios, TRAP achieves a 100% attack success rate on leading models, including LLaVA-34B, Gemma3, and Mistral-3.1, significantly outperforming baselines such as SPSA, Bandit, and standard diffusion approaches. These results expose a critical vulnerability: Autonomous agents can be consistently misled through human-imperceptible cross-modal manipulations. These findings highlight the need for defense strategies beyond pixel-level robustness to address semantic vulnerabilities in cross-modal decision-making.
摘要
基于视觉语言模型(VLMs)的自主智能AI系统正迅速向现实世界部署迈进,但其跨模态推理能力引入了新的对抗性操纵攻击面,这些攻击利用跨模态的语义推理。现有的对抗攻击通常依赖于可见像素扰动或需要特权模型或环境访问权限,这使得它们在隐蔽的实际应用中不切实际。我们提出了TRAP,一种生成对抗框架,通过基于扩散的语义注入来操纵智能体的决策。我们的方法结合了基于负面提示的退化与正面语义优化,并由连体语义网络和布局感知的空间掩码指导。TRAP无需访问模型内部,即可生成视觉上自然的图像,同时在自主AI系统中诱导出一致的选择偏差。我们在微软通用上下文对象(COCO)数据集上评估TRAP,构建多候选决策场景。在这些场景中,TRAP在包括LLaVA-34B、Gemma3和Mistral-3.1在内的领先模型上实现了100%的攻击成功率,显著优于SPSA、Bandit和标准扩散方法等基线。这些结果揭示了一个关键漏洞:自主智能体可以通过人类难以察觉的跨模态操纵被持续误导。这些发现强调,需要超越像素级鲁棒性的防御策略,以应对跨模态决策中的语义漏洞。
SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents
Abstract
arXiv:2505.23559v1 Announce Type: new Abstract: Recent advancements in large language model (LLM) agents have significantly accelerated scientific discovery automation, yet concurrently raised critical ethical and safety concerns. To systematically address these challenges, we introduce \textbf{SafeScientist}, an innovative AI scientist framework explicitly designed to enhance safety and ethical responsibility in AI-driven scientific exploration. SafeScientist proactively refuses ethically inappropriate or high-risk tasks and rigorously emphasizes safety throughout the research process. To achieve comprehensive safety oversight, we integrate multiple defensive mechanisms, including prompt monitoring, agent-collaboration monitoring, tool-use monitoring, and an ethical reviewer component. Complementing SafeScientist, we propose \textbf{SciSafetyBench}, a novel benchmark specifically designed to evaluate AI safety in scientific contexts, comprising 240 high-risk scientific tasks across 6 domains, alongside 30 specially designed scientific tools and 120 tool-related risk tasks. Extensive experiments demonstrate that SafeScientist significantly improves safety performance by 35% compared to traditional AI scientist frameworks, without compromising scientific output quality. Additionally, we rigorously validate the robustness of our safety pipeline against diverse adversarial attack methods, further confirming the effectiveness of our integrated approach. The code and data will be available at https://github.com/ulab-uiuc/SafeScientist. \textcolor{red}{Warning: this paper contains example data that may be offensive or harmful.}
摘要
尽管大语言模型(LLM)智能体的最新进展显著加速了科学发现的自动化进程,但同时也引发了关键的伦理与安全问题。为系统应对这些挑战,我们提出SafeScientist——一个创新性AI科学家框架,其核心设计目标在于增强AI驱动科学探索的安全性与伦理责任。该框架能主动拒绝伦理失当或高风险任务,并在研究全流程严格贯彻安全准则。为实现全面安全监管,我们整合了多重防御机制,包括提示监控、智能体协作监控、工具使用监控以及伦理审查组件。作为配套评估体系,我们提出SciSafetyBench基准测试,专门用于评估科学场景下的AI安全性,包含6大领域的240项高风险科学任务、30种特制科学工具及120项工具相关风险任务。大量实验表明,与传统AI科学家框架相比,SafeScientist在保持科研成果质量的同时,将安全性能提升35%。此外,我们针对多种对抗攻击方法严格验证了安全管道的鲁棒性,进一步证实了集成方案的有效性。代码与数据将在https://github.com/ulab-uiuc/SafeScientist公开。警告:本文包含可能具有冒犯性或危害性的示例数据。
Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models
Abstract
arXiv:2505.23667v1 Announce Type: new Abstract: Tables are a fundamental structure for organizing and analyzing data, making effective table understanding a critical capability for intelligent systems. While large language models (LMs) demonstrate strong general reasoning abilities, they continue to struggle with accurate numerical or symbolic reasoning over tabular data, especially in complex scenarios. Spreadsheet formulas provide a powerful and expressive medium for representing executable symbolic operations, encoding rich reasoning patterns that remain largely underutilized. In this paper, we propose Formula Tuning (Fortune), a reinforcement learning (RL) framework that trains LMs to generate executable spreadsheet formulas for question answering over general tabular data. Formula Tuning reduces the reliance on supervised formula annotations by using binary answer correctness as a reward signal, guiding the model to learn formula derivation through reasoning. We provide a theoretical analysis of its advantages and demonstrate its effectiveness through extensive experiments on seven table reasoning benchmarks. Formula Tuning substantially enhances LM performance, particularly on multi-step numerical and symbolic reasoning tasks, enabling a 7B model to outperform O1 on table understanding. This highlights the potential of formula-driven RL to advance symbolic table reasoning in LMs.
摘要
摘要:表格是组织与分析数据的基础结构,使有效的表格理解成为智能系统的关键能力。尽管大语言模型(LMs)展现出强大的通用推理能力,但在处理复杂场景下表格数据的精确数值或符号推理时仍存在困难。电子表格公式作为一种强大且富有表达力的媒介,能够表征可执行的符号操作,其编码的丰富推理模式目前尚未得到充分利用。本文提出公式调优(Formula Tuning,简称Fortune)——一种基于强化学习(RL)的框架,通过训练LMs生成可执行的电子表格公式来实现通用表格数据的问答。该方法利用二元答案正确性作为奖励信号,减少了对监督式公式标注的依赖,引导模型通过推理学习公式推导过程。我们通过理论分析阐明了其优势,并在七个表格推理基准测试中通过大量实验验证了其有效性。公式调优显著提升了LMs的性能,尤其在多步数值与符号推理任务上,使一个70亿参数的模型在表格理解任务中超越O1模型。这一成果凸显了公式驱动强化学习在推进LMs符号化表格推理方面的潜力。
Data-to-Dashboard: Multi-Agent LLM Framework for Insightful Visualization in Enterprise Analytics
Abstract
arXiv:2505.23695v1 Announce Type: new Abstract: The rapid advancement of LLMs has led to the creation of diverse agentic systems in data analysis, utilizing LLMs' capabilities to improve insight generation and visualization. In this paper, we present an agentic system that automates the data-to-dashboard pipeline through modular LLM agents capable of domain detection, concept extraction, multi-perspective analysis generation, and iterative self-reflection. Unlike existing chart QA systems, our framework simulates the analytical reasoning process of business analysts by retrieving domain-relevant knowledge and adapting to diverse datasets without relying on closed ontologies or question templates. We evaluate our system on three datasets across different domains. Benchmarked against GPT-4o with a single-prompt baseline, our approach shows improved insightfulness, domain relevance, and analytical depth, as measured by tailored evaluation metrics and qualitative human assessment. This work contributes a novel modular pipeline to bridge the path from raw data to visualization, and opens new opportunities for human-in-the-loop validation by domain experts in business analytics. All code can be found here: https://github.com/77luvC/D2D_Data2Dashboard
摘要
大型语言模型(LLMs)的快速发展催生了数据分析领域多样化的智能体系统,这些系统利用LLMs的能力提升洞察生成与可视化水平。本文提出一种通过模块化LLM智能体实现数据到仪表盘全流程自动化的智能系统,其具备领域检测、概念提取、多视角分析生成及迭代自反思能力。与现有图表问答系统不同,本框架通过检索领域相关知识并适应多样化数据集,模拟了商业分析师的推理过程,且无需依赖封闭本体论或问题模板。
我们在三个不同领域的数据集上评估系统性能。相较于采用单提示基准的GPT-4o,经定制化评估指标和人工定性测评,本方法在洞察深度、领域相关性和分析层次上均表现出显著提升。
本研究贡献了一种连接原始数据与可视化呈现的新型模块化流程,为商业分析领域专家参与人机协同验证开辟了新途径。全部代码参见:https://github.com/77luvC/D2D_Data2Dashboard
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Abstract
arXiv:2505.23762v1 Announce Type: new Abstract: The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.
摘要
大型视觉语言模型(VLM)的快速发展推动了纯视觉GUI代理的进步,这些代理能够感知并操作图形用户界面(GUI),以自主完成用户指令。然而,现有方法通常采用离线学习框架,存在两个核心局限:(1)高度依赖高质量人工标注进行元素定位和动作监督;(2)对动态交互环境的适应性有限。为解决这些问题,我们提出ZeroGUI——一种可扩展的在线学习框架,能够在零人工成本下自动化训练GUI代理。具体而言,ZeroGUI整合了:(i)基于VLM的自动任务生成,从当前环境状态产生多样化训练目标;(ii)基于VLM的自动奖励评估,无需人工设计评价函数即可判断任务成功与否;(iii)两阶段在线强化学习,持续与GUI环境交互并从中学习。在两个先进GUI代理(UI-TARS和Aguvis)上的实验表明,ZeroGUI在OSWorld和AndroidLab环境中显著提升了性能。代码已开源:https://github.com/OpenGVLab/ZeroGUI。
Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability
Abstract
arXiv:2505.23703v1 Announce Type: new Abstract: Enhancing the mathematical reasoning capabilities of LLMs has garnered significant attention in both the mathematical and computer science communities. Recent works have made substantial progress in both Natural Language (NL) reasoning and Formal Language (FL) reasoning by leveraging the potential of pure Reinforcement Learning (RL) methods on base models. However, RL approaches struggle to impart new capabilities not presented in the base model, highlighting the need to integrate more knowledge like FL into NL math reasoning effectively. Yet, this integration is challenging due to inherent disparities in problem structure and reasoning format between NL and FL. To address these challenges, we introduce NL-FL HybridReasoning, an end-to-end framework designed to incorporate the FL expert into NL math problem-solving. To bridge the NL and FL input format gap, we propose the NL-FL Problem Alignment method, which reformulates the Question-Answering (QA) problems in NL as existence theorems in FL. Subsequently, the Mixed Problem Input technique we provide enables the FL reasoner to handle both QA and existence problems concurrently. Lastly, we mitigate the NL and FL output format gap in reasoning through an LLM-based Answer Extraction mechanism. Comprehensive experiments demonstrate that the HybridReasoning framework achieves 89.80% and 84.34% accuracy rates on the MATH-500 and the AMC benchmarks, surpassing the NL baseline by 4.60% and 4.82%, respectively. Notably, some problems resolved by our framework remain unsolved by the NL baseline model even under a larger number of trials.
摘要
提升大语言模型(LLMs)的数学推理能力已成为数学与计算机科学领域的重要研究方向。近期工作通过基于纯强化学习(RL)方法在基础模型上的应用,在自然语言(NL)推理和形式语言(FL)推理方面取得显著进展。然而,RL方法难以赋予基础模型未展现的新能力,这凸显了将FL等知识有效整合至NL数学推理的必要性。但由于NL与FL在问题结构和推理形式上存在固有差异,这种整合具有挑战性。为此,我们提出NL-FL混合推理框架,该端到端框架旨在将FL专家系统融入NL数学问题求解过程。为弥合NL与FL输入格式差异,我们提出NL-FL问题对齐方法,将NL中的问答(QA)问题重构为FL中的存在性定理。随后,通过混合问题输入技术使FL推理 器能同步处理QA与存在性证明问题。最后,我们采用基于LLM的答案提取机制来消除推理过程中NL与FL输出格式的差异。综合实验表明,混合推理框架在MATH-500和AMC基准测试中分别达到**89.80%和84.34%**的准确率,较NL基线模型提升4.60%和4.82%。值得注意的是,本框架解决的若干问题即便在增加尝试次数后仍无法被NL基线模型攻克。
In Dialogue with Intelligence: Rethinking Large Language Models as Collective Knowledge
Abstract
arXiv:2505.22767v1 Announce Type: cross Abstract: Large Language Models (LLMs) are typically analysed through architectural, behavioural, or training-data lenses. This article offers a theoretical and experiential re-framing: LLMs as dynamic instantiations of Collective human Knowledge (CK), where intelligence is evoked through dialogue rather than stored statically. Drawing on concepts from neuroscience and AI, and grounded in sustained interaction with ChatGPT-4, I examine emergent dialogue patterns, the implications of fine-tuning, and the notion of co-augmentation: mutual enhancement between human and machine cognition. This perspective offers a new lens for understanding interaction, representation, and agency in contemporary AI systems.
摘要
大型语言模型(LLMs)通常通过架构、行为或训练数据的视角进行分析。本文提出一种理论与经验相结合的重构框架:将LLMs视为人类集体知识(CK)的动态具现化,其智能通过对话被激发而非静态存储。借鉴神经科学与人工智能的概念,并基于与ChatGPT-4的持续交互实践,本研究探讨了对话模式的涌现现象、微调技术的深层影响以及协同增强机制——即人类认知与机器智能的相互提升。该视角为理解当代AI系统中的交互模式、知识表征与智能代理提供了新的理论透镜。
Pre-Training Curriculum for Multi-Token Prediction in Language Models
Abstract
arXiv:2505.22757v1 Announce Type: cross Abstract: Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.
摘要
多令牌预测(MTP)是近期提出的一种语言模型预训练目标。与仅预测下一个令牌(NTP)不同,MTP通过多个预测头在每个预测步骤中同时预测后续个令牌。研究表明,MTP在提升下游任务性能、推理速度和训练效率方面具有潜力,尤其适用于大型模型。然而,已有工作发现较小规模语言模型(SLM)难以有效适应MTP目标。为此,我们提出一种MTP训练的课程学习策略,探索两种变体:正向课程(从NTP逐步过渡到MTP以增加目标复杂度)与逆向课程(采用相反路径)。实验表明,正向课程能使SLM在预训练阶段更好地利用MTP目标,既提升下游NTP性能与生成输出质量,又保留自推测解码的优势;而逆向课程虽能获得更强的NTP性能与输出质量,却无法提供任何自推测解码的收益。
Training Language Models to Generate Quality Code with Program Analysis Feedback
Abstract
arXiv:2505.22704v1 Announce Type: cross Abstract: Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.
摘要
基于大语言模型(LLMs)的代码生成(常称为氛围编程)正逐渐被应用于实际生产,但其难以保障代码质量,尤其是在安全性(如SQL注入漏洞)和可维护性(如缺失类型注解)方面。现有方法如监督微调和基于规则的后处理依赖于人工密集型标注或脆弱的启发式规则,限制了其扩展性和有效性。我们提出REAL框架——一种通过程序分析引导反馈的强化学习范式,激励LLMs生成生产级质量代码。具体而言,REAL整合了两类自动化信号:(1)检测安全或可维护性缺陷的程序分析;(2)确保功能正确性的单元测试。与先前工作不同,本框架具备提示无关性和无参考性特征,可在无需人工干预的情况下实现可扩展的监督。跨多数据集和模型规模的实验表明,REAL在功能性与代码质量的同步评估中优于现有最优方法。本研究弥合了快速原型开发与生产就绪代码之间的鸿沟,使LLMs能够兼顾效率与质量。
A Tool for Generating Exceptional Behavior Tests With Large Language Models
Abstract
arXiv:2505.22818v1 Announce Type: cross Abstract: Exceptional behavior tests (EBTs) are crucial in software development for verifying that code correctly handles unwanted events and throws appropriate exceptions. However, prior research has shown that developers often prioritize testing "happy paths", e.g., paths without unwanted events over exceptional scenarios. We present exLong, a framework that automatically generates EBTs to address this gap. exLong leverages a large language model (LLM) fine-tuned from CodeLlama and incorporates reasoning about exception-throwing traces, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. Our demonstration video illustrates how exLong can effectively assist developers in creating comprehensive EBTs for their project (available at https://youtu.be/Jro8kMgplZk).
摘要
异常行为测试(EBT)在软件开发中至关重要,用于验证代码能否正确处理非预期事件并抛出适当异常。然而,先前研究表明开发者往往优先测试"正常路径"(即无异常事件的路径)而非异常场景。为此,我们提出exLong框架,通过自动生成EBT来解决这一缺陷。exLong基于CodeLlama微调的大语言模型(LLM),整合了异常抛出轨迹推理、守卫throw语句的条件表达式,以及执行相似轨迹的非异常行为测试。我们的演示视频展现了exLong如何有效协助开发者为其项目创建全面的EBT(详见https://youtu.be/Jro8kMgplZk)。
Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems
Abstract
arXiv:2505.22771v1 Announce Type: cross Abstract: This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations -- a generative language model used for spell-correction and an encoder-based token classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.
摘要
本研究阐述了在评分流程中引入以反馈为导向的标注如何提升自动作文评分(AES)的准确性。该方法通过"用于评分、筛选及理解议论文和语篇要素的说服性论文"(PERSUADE)语料库得到验证。我们整合了两种反馈驱动的标注类型:识别拼写和语法错误的标注,以及突出论证要素的标注。为展示该方法在实际场景中的应用,我们采用两种大语言模型生成标注——一个用于拼写校正的生成式语言模型,和一个经过训练能识别并标记论证要素的基于编码器的标记分类器。通过将标注纳入评分过程,我们证明了使用基于编码器的大语言模型作为微调分类器时性能的提升。
First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay
Abstract
arXiv:2505.22809v1 Announce Type: cross Abstract: Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.
摘要
关于直接协助人类用户完成任务对话式大语言模型(LLM)代理的研究已有大量成果。我们提出了一种与LLM代理交互的替代范式,称为"旁听代理"。这类代理并不主动参与对话,而是通过"监听"人与人之间的交谈来执行后台任务或提供辅助建议。本研究以《龙与地下城》游戏为场景探索旁听代理范式,采用大型多模态音频-语言模型作为辅助地下城主持人的旁听代理进行深入分析。通过人工评估检验此类代理的实用性,我们发现部分大型音频-语言模型具备利用隐含音频线索执行旁听代理任务的涌现能力。最后,我们在https://github.com/zhudotexe/overhearing_agents发布了支持旁听代理范式进一步研究的Python库及项目代码。